3 research outputs found

    PLoS One

    Get PDF
    MOTIVATION: The recent revolution in new sequencing technologies, as a part of the continuous process of adopting new innovative protocols has strongly impacted the interpretation of relations between phenotype and genotype. Thus, understanding the resulting gene sets has become a bottleneck that needs to be addressed. Automatic methods have been proposed to facilitate the interpretation of gene sets. While statistical functional enrichment analyses are currently well known, they tend to focus on well-known genes and to ignore new information from less-studied genes. To address such issues, applying semantic similarity measures is logical if the knowledge source used to annotate the gene sets is hierarchically structured. In this work, we propose a new method for analyzing the impact of different semantic similarity measures on gene set annotations. RESULTS: We evaluated the impact of each measure by taking into consideration the two following features that correspond to relevant criteria for a "good" synthetic gene set annotation: (i) the number of annotation terms has to be drastically reduced and the representative terms must be retained while annotating the gene set, and (ii) the number of genes described by the selected terms should be as large as possible. Thus, we analyzed nine semantic similarity measures to identify the best possible compromise between both features while maintaining a sufficient level of details. Using Gene Ontology to annotate the gene sets, we obtained better results with node-based measures that use the terms' characteristics than with measures based on edges that link the terms. The annotation of the gene sets achieved with the node-based measures did not exhibit major differences regardless of the characteristics of terms used

    Development of new computational methods for a synthetic gene set annotation

    No full text
    Les avancĂ©es dans l'analyse de l'expression diffĂ©rentielle de gĂšnes ont suscitĂ© un vif intĂ©rĂȘt pour l'Ă©tude d'ensembles de gĂšnes prĂ©sentant une similaritĂ© d'expression au cours d'une mĂȘme condition expĂ©rimentale. Les approches classiques pour interprĂ©ter l'information biologique reposent sur l'utilisation de mĂ©thodes statistiques. Cependant, ces mĂ©thodes se focalisent sur les gĂšnes les plus connus tout en gĂ©nĂ©rant des informations redondantes qui peuvent ĂȘtre Ă©liminĂ©es en prenant en compte la structure des ressources de connaissances qui fournissent l'annotation. Au cours de cette thĂšse, nous avons explorĂ© diffĂ©rentes mĂ©thodes permettant l'annotation d'ensembles de gĂšnes.PremiĂšrement, nous prĂ©sentons les solutions visuelles dĂ©veloppĂ©es pour faciliter l'interprĂ©tation des rĂ©sultats d'annota-tion d'un ou plusieurs ensembles de gĂšnes. Dans ce travail, nous avons dĂ©veloppĂ© un prototype de visualisation, appelĂ© MOTVIS, qui explore l'annotation d'une collection d'ensembles des gĂšnes. MOTVIS utilise ainsi une combinaison de deux vues inter-connectĂ©es : une arborescence qui fournit un aperçu global des donnĂ©es mais aussi des informations dĂ©taillĂ©es sur les ensembles de gĂšnes, et une visualisation qui permet de se concentrer sur les termes d'annotation d'intĂ©rĂȘt. La combinaison de ces deux visualisations a l'avantage de faciliter la comprĂ©hension des rĂ©sultats biologiques lorsque des donnĂ©es complexes sont reprĂ©sentĂ©es.DeuxiĂšmement, nous abordons les limitations des approches d'enrichissement statistique en proposant une mĂ©thode originale qui analyse l'impact d'utiliser diffĂ©rentes mesures de similaritĂ© sĂ©mantique pour annoter les ensembles de gĂšnes. Pour Ă©valuer l'impact de chaque mesure, nous avons considĂ©rĂ© deux critĂšres comme Ă©tant pertinents pour Ă©valuer une annotation synthĂ©tique de qualitĂ© d'un ensemble de gĂšnes : (i) le nombre de termes d'annotation doit ĂȘtre rĂ©duit considĂ©rablement tout en gardant un niveau suffisant de dĂ©tail, et (ii) le nombre de gĂšnes dĂ©crits par les termes sĂ©lectionnĂ©s doit ĂȘtre maximisĂ©. Ainsi, neuf mesures de similaritĂ© sĂ©mantique ont Ă©tĂ© analysĂ©es pour trouver le meilleur compromis possible entre rĂ©duire le nombre de termes et maintenir un niveau suffisant de dĂ©tails fournis par les termes choisis. Tout en utilisant la Gene Ontology (GO) pour annoter les ensembles de gĂšnes, nous avons obtenu de meilleurs rĂ©sultats pour les mesures de similaritĂ© sĂ©mantique basĂ©es sur les nƓuds qui utilisent les attributs des termes, par rapport aux mesures basĂ©es sur les arĂȘtes qui utilisent les relations qui connectent les termes. Enfin, nous avons dĂ©veloppĂ© GSAn, un serveur web basĂ© sur les dĂ©veloppements prĂ©cĂ©dents et dĂ©diĂ© Ă  l'annotation d'un ensemble de gĂšnes a priori. GSAn intĂšgre MOTVIS comme outil de visualisation pour prĂ©senter conjointement les termes reprĂ©sentatifs et les gĂšnes de l'ensemble Ă©tudiĂ©. Nous avons comparĂ© GSAn avec des outils d'enrichissement et avons montrĂ© que les rĂ©sultats de GSAn constituent un bon compromis pour maximiser la couverture de gĂšnes tout en minimisant le nombre de termes.Le dernier point explorĂ© est une Ă©tape visant Ă  Ă©tudier la faisabilitĂ© d'intĂ©grer d'autres ressources dans GSAn. Nous avons ainsi intĂ©grĂ© deux ressources, l'une dĂ©crivant les maladies humaines avec Disease Ontology (DO) et l'autre les voies mĂ©taboliques avec Reactome. Le but Ă©tait de fournir de l'information supplĂ©mentaire aux utilisateurs finaux de GSAn. Nous avons Ă©valuĂ© l'impact de l'ajout de ces ressources dans GSAn lors de l'analyse d’ensembles de gĂšnes. L'intĂ©gration a amĂ©liorĂ© les rĂ©sultats en couvrant d'avantage de gĂšnes sans pour autant affecter de maniĂšre significative le nombre de termes impliquĂ©s. Ensuite, les termes GO ont Ă©tĂ© mis en correspondance avec les termes DO et Reactome, a priori et a posteriori des calculs effectuĂ©s par GSAn. Nous avons montrĂ© qu'un processus de mise en correspondance appliquĂ© a priori permettait d'obtenir un plus grand nombre d'inter-relations entre les deux ressources.The revolution in new sequencing technologies, by strongly improving the production of omics data, is greatly leading to new understandings of the relations between genotype and phenotype. To interpret and analyze data grouped according to a phenotype of interest, methods based on statistical enrichment became a standard in biology. However, these methods synthesize the biological information by a priori selecting the over-represented terms and focus on the most studied genes that may represent a limited coverage of annotated genes within a gene set. During this thesis, we explored different methods for annotating gene sets. In this frame, we developed three studies allowing the annotation of gene sets and thus improving the understanding of their biological context.First, visualization approaches were applied to represent annotation results provided by enrichment analysis for a gene set or a repertoire of gene sets. In this work, a visualization prototype called MOTVIS (MOdular Term VISualization) has been developed to provide an interactive representation of a repertoire of gene sets combining two visual metaphors: a treemap view that provides an overview and also displays detailed information about gene sets, and an indented tree view that can be used to focus on the annotation terms of interest. MOTVIS has the advantage to solve the limitations of each visual metaphor when used individually. This illustrates the interest of using different visual metaphors to facilitate the comprehension of biological results by representing complex data.Secondly, to address the issues of enrichment analysis, a new method for analyzing the impact of using different semantic similarity measures on gene set annotation was proposed. To evaluate the impact of each measure, two relevant criteria were considered for characterizing a "good" synthetic gene set annotation: (i) the number of annotation terms has to be drastically reduced while maintaining a sufficient level of details, and (ii) the number of genes described by the selected terms should be as large as possible. Thus, nine semantic similarity measures were analyzed to identify the best possible compromise between both criteria while maintaining a sufficient level of details. Using GO to annotate the gene sets, we observed better results with node-based measures that use the terms’ characteristics than with edge-based measures that use the relations terms. The annotation of the gene sets achieved with the node-based measures did not exhibit major differences regardless of the characteristics of the terms used. Then, we developed GSAn (Gene Set Annotation), a novel gene set annotation web server that uses semantic similarity measures to synthesize a priori GO annotation terms. GSAn contains the interactive visualization MOTVIS, dedicated to visualize the representative terms of gene set annotations. Compared to enrichment analysis tools, GSAn has shown excellent results in terms of maximizing the gene coverage while minimizing the number of terms.At last, the third work consisted in enriching the annotation results provided by GSAn. Since the knowledge described in GO may not be sufficient for interpreting gene sets, other biological information, such as pathways and diseases, may be useful to provide a wider biological context. Thus, two additional knowledge resources, being Reactome and Disease Ontology (DO), were integrated within GSAn. In practice, GO terms were mapped to terms of Reactome and DO, before and after applying the GSAn method. The integration of these resources improved the results in terms of gene coverage without affecting significantly the number of involved terms. Two strategies were applied to find mappings (generated or extracted from the web) between each new resource and GO. We have shown that a mapping process before computing the GSAn method allowed to obtain a larger number of inter-relations between the two knowledge resources

    DĂ©veloppement de nouvelles mĂ©thodes informatiques pour une annotation synthĂ©tique d’un ensemble de gĂšnes.

    No full text
    The revolution in new sequencing technologies, by strongly improving the production of omics data, is greatly leading to new understandings of the relations between genotype and phenotype. To interpret and analyze data grouped according to a phenotype of interest, methods based on statistical enrichment became a standard in biology. However, these methods synthesize the biological information by a priori selecting the over-represented terms and focus on the most studied genes that may represent a limited coverage of annotated genes within a gene set. During this thesis, we explored different methods for annotating gene sets. In this frame, we developed three studies allowing the annotation of gene sets and thus improving the understanding of their biological context.First, visualization approaches were applied to represent annotation results provided by enrichment analysis for a gene set or a repertoire of gene sets. In this work, a visualization prototype called MOTVIS (MOdular Term VISualization) has been developed to provide an interactive representation of a repertoire of gene sets combining two visual metaphors: a treemap view that provides an overview and also displays detailed information about gene sets, and an indented tree view that can be used to focus on the annotation terms of interest. MOTVIS has the advantage to solve the limitations of each visual metaphor when used individually. This illustrates the interest of using different visual metaphors to facilitate the comprehension of biological results by representing complex data.Secondly, to address the issues of enrichment analysis, a new method for analyzing the impact of using different semantic similarity measures on gene set annotation was proposed. To evaluate the impact of each measure, two relevant criteria were considered for characterizing a "good" synthetic gene set annotation: (i) the number of annotation terms has to be drastically reduced while maintaining a sufficient level of details, and (ii) the number of genes described by the selected terms should be as large as possible. Thus, nine semantic similarity measures were analyzed to identify the best possible compromise between both criteria while maintaining a sufficient level of details. Using GO to annotate the gene sets, we observed better results with node-based measures that use the terms’ characteristics than with edge-based measures that use the relations terms. The annotation of the gene sets achieved with the node-based measures did not exhibit major differences regardless of the characteristics of the terms used. Then, we developed GSAn (Gene Set Annotation), a novel gene set annotation web server that uses semantic similarity measures to synthesize a priori GO annotation terms. GSAn contains the interactive visualization MOTVIS, dedicated to visualize the representative terms of gene set annotations. Compared to enrichment analysis tools, GSAn has shown excellent results in terms of maximizing the gene coverage while minimizing the number of terms.At last, the third work consisted in enriching the annotation results provided by GSAn. Since the knowledge described in GO may not be sufficient for interpreting gene sets, other biological information, such as pathways and diseases, may be useful to provide a wider biological context. Thus, two additional knowledge resources, being Reactome and Disease Ontology (DO), were integrated within GSAn. In practice, GO terms were mapped to terms of Reactome and DO, before and after applying the GSAn method. The integration of these resources improved the results in terms of gene coverage without affecting significantly the number of involved terms. Two strategies were applied to find mappings (generated or extracted from the web) between each new resource and GO. We have shown that a mapping process before computing the GSAn method allowed to obtain a larger number of inter-relations between the two knowledge resources.Les avancĂ©es dans l'analyse de l'expression diffĂ©rentielle de gĂšnes ont suscitĂ© un vif intĂ©rĂȘt pour l'Ă©tude d'ensembles de gĂšnes prĂ©sentant une similaritĂ© d'expression au cours d'une mĂȘme condition expĂ©rimentale. Les approches classiques pour interprĂ©ter l'information biologique reposent sur l'utilisation de mĂ©thodes statistiques. Cependant, ces mĂ©thodes se focalisent sur les gĂšnes les plus connus tout en gĂ©nĂ©rant des informations redondantes qui peuvent ĂȘtre Ă©liminĂ©es en prenant en compte la structure des ressources de connaissances qui fournissent l'annotation. Au cours de cette thĂšse, nous avons explorĂ© diffĂ©rentes mĂ©thodes permettant l'annotation d'ensembles de gĂšnes.PremiĂšrement, nous prĂ©sentons les solutions visuelles dĂ©veloppĂ©es pour faciliter l'interprĂ©tation des rĂ©sultats d'annota-tion d'un ou plusieurs ensembles de gĂšnes. Dans ce travail, nous avons dĂ©veloppĂ© un prototype de visualisation, appelĂ© MOTVIS, qui explore l'annotation d'une collection d'ensembles des gĂšnes. MOTVIS utilise ainsi une combinaison de deux vues inter-connectĂ©es : une arborescence qui fournit un aperçu global des donnĂ©es mais aussi des informations dĂ©taillĂ©es sur les ensembles de gĂšnes, et une visualisation qui permet de se concentrer sur les termes d'annotation d'intĂ©rĂȘt. La combinaison de ces deux visualisations a l'avantage de faciliter la comprĂ©hension des rĂ©sultats biologiques lorsque des donnĂ©es complexes sont reprĂ©sentĂ©es.DeuxiĂšmement, nous abordons les limitations des approches d'enrichissement statistique en proposant une mĂ©thode originale qui analyse l'impact d'utiliser diffĂ©rentes mesures de similaritĂ© sĂ©mantique pour annoter les ensembles de gĂšnes. Pour Ă©valuer l'impact de chaque mesure, nous avons considĂ©rĂ© deux critĂšres comme Ă©tant pertinents pour Ă©valuer une annotation synthĂ©tique de qualitĂ© d'un ensemble de gĂšnes : (i) le nombre de termes d'annotation doit ĂȘtre rĂ©duit considĂ©rablement tout en gardant un niveau suffisant de dĂ©tail, et (ii) le nombre de gĂšnes dĂ©crits par les termes sĂ©lectionnĂ©s doit ĂȘtre maximisĂ©. Ainsi, neuf mesures de similaritĂ© sĂ©mantique ont Ă©tĂ© analysĂ©es pour trouver le meilleur compromis possible entre rĂ©duire le nombre de termes et maintenir un niveau suffisant de dĂ©tails fournis par les termes choisis. Tout en utilisant la Gene Ontology (GO) pour annoter les ensembles de gĂšnes, nous avons obtenu de meilleurs rĂ©sultats pour les mesures de similaritĂ© sĂ©mantique basĂ©es sur les nƓuds qui utilisent les attributs des termes, par rapport aux mesures basĂ©es sur les arĂȘtes qui utilisent les relations qui connectent les termes. Enfin, nous avons dĂ©veloppĂ© GSAn, un serveur web basĂ© sur les dĂ©veloppements prĂ©cĂ©dents et dĂ©diĂ© Ă  l'annotation d'un ensemble de gĂšnes a priori. GSAn intĂšgre MOTVIS comme outil de visualisation pour prĂ©senter conjointement les termes reprĂ©sentatifs et les gĂšnes de l'ensemble Ă©tudiĂ©. Nous avons comparĂ© GSAn avec des outils d'enrichissement et avons montrĂ© que les rĂ©sultats de GSAn constituent un bon compromis pour maximiser la couverture de gĂšnes tout en minimisant le nombre de termes.Le dernier point explorĂ© est une Ă©tape visant Ă  Ă©tudier la faisabilitĂ© d'intĂ©grer d'autres ressources dans GSAn. Nous avons ainsi intĂ©grĂ© deux ressources, l'une dĂ©crivant les maladies humaines avec Disease Ontology (DO) et l'autre les voies mĂ©taboliques avec Reactome. Le but Ă©tait de fournir de l'information supplĂ©mentaire aux utilisateurs finaux de GSAn. Nous avons Ă©valuĂ© l'impact de l'ajout de ces ressources dans GSAn lors de l'analyse d’ensembles de gĂšnes. L'intĂ©gration a amĂ©liorĂ© les rĂ©sultats en couvrant d'avantage de gĂšnes sans pour autant affecter de maniĂšre significative le nombre de termes impliquĂ©s. Ensuite, les termes GO ont Ă©tĂ© mis en correspondance avec les termes DO et Reactome, a priori et a posteriori des calculs effectuĂ©s par GSAn. Nous avons montrĂ© qu'un processus de mise en correspondance appliquĂ© a priori permettait d'obtenir un plus grand nombre d'inter-relations entre les deux ressources
    corecore